Introduction

Out of all the advantages of attending UC Santa Barbara, one of the most often cited is the lovely weather year-round in Goleta and SB. This stable, temperate climate is created by a variety of factors, not least of all the coastal location and the tall Santa Ynez mountains in the north acting as a sort of shield against weather conditions further inland. Evaporated moisture from the sea is trapped by the mountains, creating a humid yet pleasant climate for residents of this small city. Frequently, the humidity and other weather conditions result in a haze blanketing the entire Goleta/SB area, while other days the marine layer takes the place of this haze covering the sky.

When I began searching for a topic for this project, I knew that I wanted to apply my data science skills to a real-world problem that was personal to me, and weather prediction was what I settled on. I searched for weather datasets all over the internet, but unfortunately the NOAA did not have complete data records, and hundreds if not thousands of observations were missing from every dataset I downloaded. However, I finally found a good dataset to work with.

Data Source

After several weeks of searching, I discovered that Weather Underground had complete historical records of daily temperature, humidity, wind speed, dew point, and air pressure for the past several years. The data was recorded at Santa Barbara Municipal Airport. Since WU had no CSV or XLSX files I could download, I manually copy-pasted the data tables into Microsoft Excel for every month from May 2015 to May 2022 and converted the file to a CSV. I cleaned the data by hand as well — there were 2557 observations in total, of which 27 were completely missing. I assumed that these were days when the weather station was out of service. In the end, I now have 2530 observations with which to predict humidity. With that, let’s take a look at the dataset itself.

Exploratory Data Analysis

Before anything else, we must load all of our packages. You can see the full list of packages I am using below.

library(tidymodels)
library(ISLR)
library(ISLR2)
library(discrim)
library(poissonreg)
library(corrr)
library(klaR)
library(dplyr)
library(ggplot2)
library(glmnet)
library(parsnip)
library(janitor)
library(corrplot)
library(randomForest)
library(rpart)
library(rpart.plot)
library(regclass)
library(broomstick)



Reading in Data

Now that that’s out of the way, it’s time to read in my data. I used clean_names() from the janitor package to make my data easier to wrangle. I will also make sure R can read my date values as dates rather than characters, and add a year_month variable in order to calculate monthly averages. This makes some of my data, such as precipitation, temperature, and humidity, far easier to plot.

weather <- read.csv("weatherdatanew.csv") %>% clean_names()
# We clean the variable names to make the code easier to write
weather$date <- as.Date(weather$date, format="%m/%d/%y")
# Cleaning date values
weather <- weather %>% mutate(year_month=format(date, "%y/%m"), .before=date)
# Adding year_month variable in order to create monthly averages for weather values



Correlation Matrix

Now it’s time to start my EDA. Below, I created a correlation matrix of all of my variables in order to highlight important relationships between the predictors and response.

weather %>% 
  dplyr::select(where(is.numeric)) %>% 
  cor() %>% 
  corrplot(type = 'lower', diag = FALSE, 
           method = 'color')

# Correlation matrix of all numeric variables in order to ascertain which correlations stand out


The correlation matrix has some interesting implications - while some of the relationships were anticipated, such as the correlation between average wind speed and maximum wind speed, some others were less obvious. In particular, the moderate negative correlation between average humidity and maximum wind speed and the strong positive correlation between average dew point and humidity stood out to me. Thinking about it, these relationships make sense. High wind speed means that water molecules in the air are spread further apart from each other, so it’s harder for them to condense - thus, dew point is lower. Average dew point is also related to humidity in that dew point is the air temperature at which relative humidity would be 100% - water then begins to condense into dew. If humidity is high, it makes sense that the dew point would not have to be as low for relative humidity to reach 100% and for water to start condensing.

Exploring Individual Variables

I’d like to find out if there are any patterns in my data, so I’ve plotted several line graphs below, each exploring one of the variables.

Average Monthly Humidity

humidity_avg <- aggregate(average_humidity ~ year_month, weather, mean)
# Aggregate means of monthly humidity in a new data frame
h <- ggplot(data=humidity_avg, aes(x=year_month, y=average_humidity, group=1)) + 
  geom_line() + labs(y="Average Humidity", x="Year/Month")  
# Store plot for monthly average humidity
h + theme(axis.text.x = element_text(angle = 90))

# Show plot while rotating x-axis values 90 degrees to prevent crowding


While this graph is somewhat rough, it still makes sense - humidity tends to peak most in the warmest months, as increased sunlight and heat causes more water to evaporate from the ocean and flora/fauna on land.

Average Monthly Temperature/Dew Point

temp_avg <- aggregate(average_temperature_o_f ~ year_month, weather, mean)
# Aggregate means of monthly temp in a new data frame
dew_avg <- aggregate(average_dew_point_o_f ~ year_month, weather, mean)
# Aggregate means of monthly dew point in a new data frame

total1 <- merge(humidity_avg, temp_avg, by="year_month")
total <- merge(total1, dew_avg, by="year_month")
ggplot(data=total, aes(x=year_month)) + 
  geom_line(aes(y = average_temperature_o_f, group=1, color="Temperature")) + 
  geom_line(aes(y = average_dew_point_o_f, group=1, color="Dew Point")) + 
  theme(axis.text.x = element_text(angle = 90)) + labs(y="Degrees Fahrenheit", x="Year/Month")


As expected, monthly average temperature and dew point closely follow each other. As temperatures increase, the temperature at which water condenses from the air grows higher, as there is more water vapor in the air. This correlates well with the humidity graph as well, which has peaks in warmer months.



Total Monthly Precipitation

precip_month <- aggregate(precipitation_in ~ year_month, weather, sum)

ggplot(data=precip_month, aes(x=year_month)) + geom_line(aes(y=precipitation_in, group=1)) + 
  theme(axis.text.x = element_text(angle = 90)) + labs(y="Total Precipitation", x="Year/Month")


This graph is hardly surprising. Precipitation peaks regularly between November and April every year, with the height of the peaks varying depending on whether a drought is currently in effect as well as other environmental factors.



Daily Average Pressure

ggplot(data=weather, aes(x=date)) + geom_point(aes(y=average_pressure_in_hg, group=1)) + 
  theme(axis.text.x = element_text(angle = 90)) + labs(y="Average Pressure (inHg)", x="Date")


Since Santa Barbara Airport is very close to sea level, pressure stays very stable in the area. Still, some peaks and troughs are visible in this scatterplot, though they are shallow. Notably, four points stand out as outliers below the graph - I imagine these are recording errors, or days of exceptionally low pressure prior to a storm. There are a few other points that stick out slightly below and above the main line of the scatterplot, but they seem to be legitimate high- or low-pressure days rather than recording errors.

Daily Average and Maximum Wind Speed

ggplot(data=weather, aes(x=date)) + geom_line(aes(y=average_wind_speed_mph, group=1, color="Average Wind Speed")) +
  geom_line(aes(y=maximum_wind_speed_mph, group=1, color="Maximum Wind Speed")) +
  theme(axis.text.x = element_text(angle=90)) + labs(y="Miles per Hour", x="Date")


Interestingly, average and maximum wind speed both dip every year around December/January. I’m not sure exactly why this is, but I believe it has to do with the fact that air pressure tends to stay low around that time of year, so there is not much movement of air.



Modeling

Now it’s time to model the data. I’m choosing four models: a regression decision tree, a random forest with bagging, a boosted trees model, and a linear regression model.

Initial split

Since my dataset is relatively small (2530 observations), I’m choosing a 75-25 training-testing split and only creating 5 cross-validation folds with 5 repeats. I’m stratifying on the outcome variable as well.

set.seed(3945)
weather_split <- initial_split(weather, prop=0.75, strata=average_humidity)
weather_train <- training(weather_split)
weather_test <- testing(weather_split)
weather_fold <- vfold_cv(weather_train, v=5, strata=average_humidity, repeats=5)



Recipe creation

Date variables are tiresome to integrate into our models, especially considering they do not fit neatly into numeric values, so we will leave them out when creating our recipe.

weather_recipe <- recipe(average_humidity ~ average_temperature_o_f + average_dew_point_o_f + maximum_wind_speed_mph + average_wind_speed_mph + average_pressure_in_hg + precipitation_in, data=weather_train)



Model creation


Regression Tree


First we fit a regression decision tree. We are tuning cost_complexity and tree_depth, then choosing the best model out of a tuning grid with 5 levels. After selecting the best decision tree, we will plot it so we can visualize it.

set.seed(3945)
regtreespec <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("regression")

regtreewf <- workflow() %>%
  add_model(regtreespec %>% set_args(cost_complexity = tune(), tree_depth = tune())) %>%
  add_formula(average_humidity ~ maximum_wind_speed_mph + average_wind_speed_mph + average_pressure_in_hg + precipitation_in + average_temperature_o_f + average_dew_point_o_f)

param_grid <- grid_regular(cost_complexity(), tree_depth(), levels = 5)

tune_res <- tune_grid(
  regtreewf,
  resamples = weather_fold, 
  grid = param_grid
)

bestcomplexity <- select_best(tune_res, metric = "rmse")

regtreefinal <- finalize_workflow(regtreewf, bestcomplexity)

regtreefinalfit <- fit(regtreefinal, data = weather_train)

getcp(extract_fit_engine(regtreefinalfit))

regtreefinalpruned <- prune(extract_fit_engine(regtreefinalfit), cp=2.225653e-05)

save(regtreespec, regtreewf, param_grid, tune_res, bestcomplexity, regtreefinal, regtreefinalfit, regtreefinalpruned,  file="regtree_models.RData")


I used the getcp() function to find the tree that would result in the lowest level of cross-validation error, that being this tree. Now we can look at our decision tree: it’s extremely large, with dozens of nodes.
A zoomed-out screenshot of the overall decision tree
This looks like it might be a bit overfitted, but we’ll see how it turns out when we try to predict our testing data. A detailed image of the whole tree is available at the link in DecisionTree.txt. Note that the file is about 1.09 GB and so will take some time to load - you will need to zoom in to view the whole tree. One main thing to note is that the tree model only chose two predictors: average_temperature_o_f and average_dew_point_o_f. Evidently, these two predictors are the most relevant to humidity, with other factors not predicting it as accurately.

Let’s look at how well this tree predicts the testing set.

weather_test %>%
  mutate(
    .pred=predict(regtreefinalpruned, weather_test)
  ) %>% 
  rmse(truth=average_humidity, estimate=.pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        2.56


Our tree is actually surprisingly accurate! Using the cp value provided by getcp(), we find that the root mean square error is 2.56%. Considering that the humidity ranges from 27.4% to 99.5%, this is actually a pretty good margin of error. Our decision tree is fairly accurate.

We can also see a scatterplot of the actual data points compared to the predicted values:

weather_test %>%
  mutate(
    .pred=predict(regtreefinalpruned, weather_test)
  ) %>%
  ggplot(aes(average_humidity, .pred)) +
  geom_abline() +
  geom_point(alpha = 0.5)


It looks like the points follow the line pretty well! Interestingly, the points start to get closer to the line as the percentages grow. The error grows larger the smaller the percentage. This is likely because Santa Barbara is located on the coastline, so humidity is quite stable in the 70-90% range. Because of this stability, most of the data points are located in that range as well, giving the model more data with which to predict humidity. Lower humidity levels are comparatively rare, so a lot more “noise” appears in the model.

Random Forest with Bagging


Next we will fit a Random Forest model with bagging. We set mtry to 6 since there are 6 predictors.

set.seed(3945)

bagging_spec <- rand_forest(mtry = 6) %>%
  set_engine("randomForest") %>%
  set_mode("regression")

baggingwf <- workflow() %>%
  add_model(bagging_spec %>% set_args(min_n = tune())) %>%
  add_formula(average_humidity ~ maximum_wind_speed_mph + average_wind_speed_mph + average_pressure_in_hg + precipitation_in + average_temperature_o_f + average_dew_point_o_f)

param_grid_bag <- grid_regular(min_n(), levels = 5)

tune_res_bag <- tune_grid(
  baggingwf,
  resamples = weather_fold, 
  grid = param_grid_bag
)

bestcomplexitybag <- select_best(tune_res_bag, metric = "rmse")

bagfinal <- finalize_workflow(baggingwf, bestcomplexitybag)

bagfinalfit <- fit(bagfinal, data = weather_train)


Now let’s see how this model performs on the testing set.

augment(bagfinalfit, new_data = weather_test) %>%
  rmse(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        1.45


The RMSE for this model is significantly lower than for the Decision Tree model - we have an RMSE of 1.45% as compared to 2.56%, a full 1.11% lower.

Boosted Tree

Now we fit a Boosted Tree model. We set trees to 5000 and mtry to 6 like the previous model, then tune tree_depth and min_n, using the xgboost engine.

set.seed(3945)

boost_spec <- boost_tree(trees = 5000, learn_rate = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

boostwf <- workflow() %>%
  add_model(boost_spec) %>%
  add_formula(average_humidity ~ maximum_wind_speed_mph + average_wind_speed_mph + average_pressure_in_hg + precipitation_in + average_temperature_o_f + average_dew_point_o_f)
  
param_grid_boost <- grid_regular(learn_rate(range=c(-10, -1)), levels = 5)

tune_res_boost <- tune_grid(
  boostwf,
  resamples = weather_fold, 
  grid = param_grid_boost
)

bestcomplexityboost <- select_best(tune_res_boost, metric = "rmse")

boostfinal <- finalize_workflow(boostwf, bestcomplexityboost)

boostfinalfit <- fit(boostfinal, data = weather_train)


Let’s see how the Boosted Tree model performs on the testing data.

augment(boostfinalfit, new_data = weather_test) %>%
  rmse(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        1.39


Our RMSE for this model is 1.39%, a bit lower than it was for the Random Forest model! This is 1.17% less than our Decision Tree RMSE.



Linear regression


Finally, we fit a Linear Regression model. This is straightforward compared to the other models.

set.seed(3945)

lm_model <- linear_reg() %>% 
  set_engine("lm")

lm_wflow <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(weather_recipe)

lm_fit <- fit(lm_wflow, weather_train)


Now let’s see the performance of the Linear Regression model on the testing set.

augment(lm_fit, new_data = weather_test) %>%
  rmse(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        2.26

The RMSE for this model is 2.26% - though this is certainly on the higher side, it’s still 0.3% lower than the Decision Tree model.

Evaluating Model Performance

Now that we’ve calculated the RMSE for each model, we can see that the Boosted Trees model is the most accurate for predicting the testing data given the training set, with an RMSE of only 1.39%. The Random Forest with Bagging is the second most accurate at 1.45%, Linear Regression is third at 2.26%, and our Decision Tree is last at 2.56%. However, this is just for the RMSE metric. Let’s see how accurate each model is when the yardstick is R-squared and MAE (Mean Absolute Error) as well.

weather_test %>%
  mutate(
    .pred=predict(regtreefinalpruned, weather_test)
  ) %>% 
  rsq(truth=average_humidity, estimate=.pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.967
augment(bagfinalfit, new_data = weather_test) %>%
  rsq(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.990
augment(boostfinalfit, new_data = weather_test) %>%
  rsq(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.990
augment(lm_fit, new_data = weather_test) %>%
  rsq(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.974
weather_test %>%
  mutate(
    .pred=predict(regtreefinalpruned, weather_test)
  ) %>% 
  mae(truth=average_humidity, estimate=.pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mae     standard        1.93
augment(bagfinalfit, new_data = weather_test) %>%
  mae(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mae     standard       0.970
augment(boostfinalfit, new_data = weather_test) %>%
  mae(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mae     standard       0.956
augment(lm_fit, new_data = weather_test) %>%
  mae(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mae     standard        1.68

These are interesting patterns. When we use R-squared as our yardstick, we see that – while the Decision Tree still performs the worst and Linear Regression still performs second-worst – our Random Forest and Boosted Trees models are equally accurate in predicting the testing data. In addition, all of the models have an R-squared value of over 0.95. Knowing that, it seems like splitting hairs to compare them, but this is still useful information.

When we use MAE as our yardstick, we see the same pattern as the RMSE. The models in order of best to worst performance are Boosted Trees (0.956% MAE), Random Forest (0.97% MAE), Linear Regression (1.68% MAE), and Decision Tree (1.93% MAE).

Conclusion

Finally, I can conclude that the Boosted Trees model is the best model for predicting our weather data. Although the R-squared comparison showed that the Random Forest and Boosted Trees models were the same, The Mean Absolute Error and Root Mean Square Error comparisons both had Boosted Trees performing the best. Upon researching the reason for why this might be, I found that while Boosted Trees can be more prone to overfitting than Random Forests, they still perform better on average. Since we had 5 cross-validation folds with 5 repeats each, we can safely conclude that our models are safeguarded against overfitting. A question I would like to explore in the future would be whether I can predict all of the weather parameters, or at least most of them, using date and time of year as predictor variables. This project taught me a lot about modeling data using the models that are most appropriate for the type of data that I have. I also learned important lessons about gathering my data from reputable and consistent sources. I would like to replicate this project or work on a similar one in Python in the future to grow my capabilities in data modeling and machine learning.